32 research outputs found

    A Lazy Semantics for Program Slicing

    Get PDF
    This paper demonstrates that if a slicing algorithm is expressed denotationally, without intermediate structures, then the power of denotational semantics can be used to prove correctness. The semantics preserved by slicing algorithms, however, is non-standard. We introduce a new lazy semantics which we prove is preserved by slicing algorithms. It is demonstrated how other concepts in program dependence, difficult or impossible to express using standard semantics, for example variable dependence, can be expressed naturally using our new lazy semantics

    Node coarsening calculi for program slicing

    Get PDF
    Several approaches to reverse and re-engineering are based upon program slicing. Unfortunately, for large systems, such as those which typically form the subject of reverse engineering activities, the space and time requirements of slicing can be a barrier to successful application. Faced with this problem, several authors have found it helpful to merge control flow graph (CFG) nodes, thereby improving the space and time requirements of standard slicing algorithms. The node-merging process essentially creates a 'coarser' version of the original CFG. The paper introduces a theory for defining control flow graph node coarsening calculi. The theory formalizes properties of interest, when coarsening is used as a precursor to program slicing. The theory is illustrated with a case study of a coarsening calculus, which is proved to have the desired properties of sharpness and consistency

    Extending Naive Bayes Classifier with Hierarchy Feature Level Information for Record Linkage

    Get PDF
    Probabilistic record linkage has been well investigated in re- cent years. The Fellegi-Sunter probabilistic record linkage and its enhanced version are commonly used methods, which calculate match and non-match weights for each pair of corresponding fields of record-pairs. Bayesian network classifiers – naive Bayes classifier and TAN have also been successfully used here. Very recently, an extended version of TAN (called ETAN) has been developed and proved superior in classification accuracy to conventional TAN. However, no previous work has applied ETAN in record linkage and investigated the benefits of using a nat rally existing hierarchy feature level information. In this work, we extend the naive Bayes classifier with such information. Finally we apply all the methods to four datasets and estimate the F1 scores

    A Probabilistic Address Parser Using Conditional Random Fields and Stochastic Regular Grammar

    Get PDF
    Automatic semantic annotation of data from databases or the web is an important pre-process for data cleansing and record linkage. It can be used to resolve the problem of imperfect field alignment in a database or identify comparable fields for matching records from multiple sources. The annotation process is not trivial because data values may be noisy, such as abbreviations, variations or misspellings. In particular, overlapping features usually exist in a lexicon-based approach. In this work, we present a probabilistic address parser based on linear-chain conditional random fields (CRFs), which allow more expressive token-level features compared to hidden Markov models (HMMs). In additions, we also proposed two general enhancement techniques to improve the performance. One is taking original semi-structure of the data into account. Another is post-processing of the output sequences of the parser by combining its conditional probability and a score function, which is based on a learned stochastic regular grammar (SRG) that captures segment-level dependencies. Experiments were conducted by comparing the CRF parser to a HMM parser and a semi-Markov CRF parser in two real-world datasets. The CRF parser out-performed the HMM parser and the semiMarkov CRF in both datasets in terms of classification accuracy. Leveraging the structure of the data and combining the linear chain CRF with the SRG further improved the parser to achieve an accuracy of 97% on a postal dataset and 96% on a company dataset

    Entity Search/Match in Relational Databases

    Get PDF
    We study an entity search/match problem that requires retrieved tuples match to an input entity query. We assume the input queries are of the same type as the tuples in a materialised relational table. Existing keyword search over relational databases focuses on assembling tuples from a variety of relational tables in order to respond to a keyword query. The entity queries in this work differ from the keyword queries in two ways: (i) an entity query roughly refers to an entity that contains a number of attribute values, i.e. a product entity or an address entity; (ii) there might be redundant or incorrect information in the entity queries that could lead to misinterpretations of the queries. In this paper, we propose a transformation that first converts an unstructured entity query into a multi-valued structured query, and two retrieval methods are proposed to generate a set of candidate tuples from the database. The retrieval methods essentially formulate SQL queries against the database given the multi-valued structured query. The results of a comprehensive evaluation of a large-scale database (more than 29 millions tuples) and two real-world datasets showed that our methods have a good trade-off between generating correct candidates and the retrieval time compared to baseline approaches

    Improving Record Linkage Accuracy with Hierarchical Feature Level Information and Parsed Data

    Get PDF
    Probabilistic record linkage is a well established topic in the literature. Fellegi-Sunter probabilistic record linkage and its enhanced versions are commonly used methods, which calculate match and non- match weights for each pair of records. Bayesian network classifiers – naive Bayes classifier and TAN have also been successfully used here. Recently, an extended version of TAN (called ETAN) has been developed and proved superior in classification accuracy to conventional TAN. However, no previous work has applied ETAN to record linkage and investigated the benefits of using naturally existing hierarchical feature level information and parsed fields of the datasets. In this work, we ex- tend the naive Bayes classifier with such hierarchical feature level information. Finally we illustrate the benefits of our method over previously proposed methods on 4 datasets in terms of the linkage performance (F1 score). We also show the results can be further improved by evaluating the benefit provided by additionally parsing the fields of these datasets

    Empirical Study of Partitions Similarity Measures

    Get PDF
    This paper compares four existing distances and similarity measures between partitions. The partition measures considered in this paper are the Rand Index (RI), the Adjusted Rand Index (ARI), the Variation of Information (VI) and finally, the Normalised Variation of Information (NVI). This work investigates the ability of these partition measures to capture three predefined intuitions: the variation within randomly generated partitions, the sensitivity to small perturbations and finally the independence from the dataset scale . It has been shown that the Adjusted Rand Index (ARI) performed well overall, regarding these three intuitions

    Packing dimensions of projections and dimension profiles

    No full text
    For E a subset of R(n) and 0 less than or equal to m less than or equal to n we define a 'family of dimensions' Dim(m)E, closely related to the packing dimension off, with the property that the orthogonal projection of E onto almost all m-dimensional subspaces has packing dimension Dim(m)E. In particular the packing dimension of almost all such projections must be equal. We obtain similar results for the packing dimension of the projections of measures. We are led to think of Dim(m)E for m is an element of [0, n] as a 'dimension profile' that reflects a variety of geometrical properties of E, and we characterize the dimension profiles that are obtainable in this way.</p

    A non-standard semantics for program slicing and dependence analysis

    Get PDF
    We introduce a new non-strict semantics for a simple while language. We demonstrate that this semantics allows us to give a denotational definition of variable dependence and neededness, which is consistent with program slicing. Unlike other semantics used in variable dependence, our semantics is substitutive. We prove that our semantics is preserved by traditional slicing algorithms
    corecore